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ABSTRACT 





The need of security over the web is the foremost necessity and handling the cybercrimes is a priority. The growing popularity of the social media has led the children to 
use the internet more for social communication than information gathering. Children needs to learn and grow with technology but child safety is also required. 
Pedophiles hunt for innocent children over such social media and chat room platforms which are not safe for the child. Due to lack of parental guidance, such cases lead 
to cybercrimes which kids are not aware of. Social media is not the only area where pedophilic activities takes place. The search on the search engine may also help in 
detecting a pedophile. Here, the main idea is to capture the pedophiles using the conversions made with a child and detecting it based on the pattern of words and 
language used by an adult. Also, with the help of the search engine's query detection a pedophilic activity can be traced. 
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INTRODUCTION 

Protection of children on cyber space is an extremely critical problem faced by 

our society across geographical and cultural boundaries. As more and more chil- 

dren in their teens have started using the Internet, there has been an alarming 

increase in cases of child abuse through the Web.[1] As a report published by the 

National Center for Missing and Exploited Children (NCMEC),1 out of 7 kids is 

solicited for sex online; | out of 33 kids receives aggressive online solicitation to 

meet in person | out of 3 kids receives unsolicited sexual content online.” 

Internet nowadays is providing an easy and convenient access to the predators or 
criminals. Parents on the other hand does not track how their children are using 

the Internet. The lack of attention from parents and the criminal intentions of 
some people gives birth to cybercrime in children. The pedophiles are people hav- 
ing a psychic disorder and are sexually attracted towards the prepubescents. 

Today cybercrime activities such as pedophilia activities are a major issue of con- 

cern. This activity is termed a cyberpedophilia. Children are not safe on the 

internet and there is a dire need for an internet space that is safe for children. It has 

always been recommended that parents monitor their children's activities on the 

internet as to what they post, what they see, whom do they chat to, what kind of 
messages so they receive. 


The World Wide Web is an architectural framework for accessing linked docu- 

ments spread out over millions of machines all over the Internet. With the 

Internet usage gaining popularity and the steady growth of users, the World Wide 

Web has become a huge repository of data and serves as an important platform 

for the dissemination of information. Web mining can then be defined as for the 

discovery and analysis of useful information from the World Wide Web. The com- 
bination of Data Mining and World Wide Web is termed as Web Mining. Web min- 
ing 1s the use of data mining techniques to automatically discover and extract 

information from Web documents and services. Web mining 1s the application of 
data mining technique to web data to discover useful patterns. The data available 

on web is termed as Web data and the process of mining the web data is termed as 

Web mining. The most commonly used techniques are association rule, classifi- 

cation, clustering and sequential pattern identification. 


Web data are usually in the following forms: Web content that includes text, 
images, structured records, videos, audio files etc., Web Structure that includes 
hyperlinks, document structure and tags and Web Usage that includes web server 
logs, application server logs and application level logs. 


For any type of mining the most important step to be done is preprocessing. Data 
preprocessing is the step where the raw data is processed in such a way that the 
extracted data would be useful to mine some knowledge. There are some levels 
of processing done on the raw data to obtain a knowledgeable data. These levels 
include selecting the target data from the raw data, extraction of some data and 
transform the processed data to obtain knowledge. The processing includes 
cleaning of noisy data, integrating the data, data transformation, data reduction 
and data discretization. 


There are three types of web mining techniques based upon the usage and the 
type of knowledge to be mined and extracted. Web usage mining, web structure 
mining and web content mining. 


Taxonomy of web mining: 

A. Web Structure Mining 

Web structure mining is the process of using graph theory to analyze the node and 
connection structure of a web site. The preprocessing of this type of mining 
involves identifying interesting graph patterns or preprocessing the whole web 
graph to come up with some matrices. The most common example is PageRank. 
The structure of a typical web graph consists of web pages as nodes and 
hyperlinks as edges connecting between two related pages. Such mining can be 
done on intra-page document level or inter-page hyperlink level. Some of the 
major use of this technique is done in PageRank algorithm, Hubs and Authorities, 
HITS algorithm, Information Scent, etc. Useful information such as quality ofa 
web page, interesting web structures and web page classification can be 
obtained. 


B. Web Usage Mining 

Web usage mining is the process of applying data mining techniques to the dis- 
covery of usage patterns from web data. The data available on the web is not only 
huge but also semi-structured. The browsing history of the user is stored in a log 
file which can be used to mine interesting patterns. logs, proxy logs or browser 
logs. These log files hold a lot of information such as URLs, IP addresses, time, 
date, etc. When people visit one website, they leave some data such as IP address, 
visiting pages, visiting time and so on, web usage mining will collect, analyses 
and process the log and recording data. '' This technique is widely used in 
ecommerce, web transactions, path and pattern discovery, pattern analysis and 


many more. 
Web Mructure Web Usage 
Mining Mining 
Document Server logs 
Mructure 
Application 
Server logs 
Application 
Level logs 
C. Web Content Mining 


Web content mining is the process to discover useful information from text, 
image, audio or video data on the web. Information retrieval is the basic means 
for any information gathering technique which helps user to find the specific 
information from the large set of data." This technique is mainly used for Natural 


Intra-Doc Inter- Doc 
Hyperlink Hyperlink 


Fig. 1 Taxonomy of Web Mining 
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Language Processing and Information Retrieval. Due to enormous size, a web 
query can result in multiple results possibly with repetition. Thus, we need to 
present a technique that reduces the amount of information in the result that is 
appropriate for our knowledge mining. 


Web search deals with Information retrieval, a study that helps to retrieve knowl- 
edge and information from a vast dataset most likely to be web data in this case. 


Web Content Mining 


Web Page Content Mining Web Search Result Mining 
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Fig. 2 Web Content Mining 


The efforts can be grouped into two subcategories: 

(i) Agent based Approach: The agent approach uses so called Web agents to 
collect relevant information from the World Wide Web. A Web agent is a pro- 
gram that visits a Web site and filters the information the user 1s interested in. 
There are three subtypes for the agent based approach: Intelligent Search 
Agents, Information Filtering/Categorization and the Personalized Web 
Agents. For more information about these subtypes."” 


(ii) Database Approach: The database approach for Web mining tries to 
develop techniques for organizing semi structured data stored in the Web 
into more structured collections of information resources. Standard data- 
base querying mechanisms and data mining techniques can be used to ana- 
lyze those collections then.” 


Preprocessing of Web Data: 

The content data needs to be preprocessed before the actual knowledge mining 
process. The content is in an unstructured form and has many components that 
are not useful or important for our knowledge. 


Steps for preprocessing the content of Web data are as follows: 

1. Extract text from HTML 

Any web data is in the form of HTML page. The aim is to extract the data that are 
available on the web. This data is in the form of text, audio, video, animations etc. 
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Fig 3. Content Preparation 


The required data that is useful for the mining process needs to be extracted so 
that processing can be done and useful information can be obtained from the 
extracted data. The aim is to decide what type of data is required for our knowl- 
edge and how much data needs to be extracted from the whole web data. 


2. Perform Stemming 

Stemming is the process of deducing words from the text. The process includes 
reducing the words from the word-stem. Many search engines treat word and 
word-stem as synonyms to increase the result of search query. Words such as 
“fishing”, “fished” are “fisher” stem words for the root word “fish”. 


3. Removing Stop Words 

Stop words are words that are most commonly in the language. These words are 
most widely used for natural language processing tools. There is no group of 
words for the stop words. Words such as “T’’, “you”, “the’’, “it”, “what”, “as” etc. 
are some of the common stop words. 
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4. Calculate Collection Wise Word Frequencies (DF) 

The text collectively has repetitive words which are to be removed from the text. 
The text here is considered as the whole website which includes all the webpages 
linked to each other. Thus, unique words are identified and extracted from the 
web data. These unique words are also repetitively used in the text therefore the 
frequency of each unique word is calculated. 


5. Calculate per Document term frequency(TF) 
Here, the text is considered only for the single webpage. The frequency of each 
unique word is calculated from the web page. 


The objective is to preprocess the chat log data using the data mining preprocess- 
ing technique. The processed chat log helps to distinguish the users in the conver- 
sation. Based on the terminology, language of an individual, usage of negative 
words and its frequency, identification of pedophile is done. Also, using the nega- 
tive words as a base, detection of a pedophilic activity can be done on the search 
engine. 


The advancement of Information and Communication Technology has led to vari- 
ous innovations in our personal lives. Today chat servers are a vital part in life. 
Messaging applications such as Facebook, Instant Messenger, Yahoo Messen- 
ger, WhatsApp, etc. are most popular amongst the youth nowadays. However, 
these messaging applications cannot be controlled or managed. The peer-to-peer 
chat conversation is almost impossible to keep a track on. The main issue with 
such messaging system is that it is difficult to know the person on the opposite 
side. The person on the other side can be a predator for the young children. This is 
only possible to know if we have the conversation and the way the predator 
builds arelationship with the child. 


Apart from conversations, predators can be tracked down with the help of search 
engines. Predators tend to search for images, videos and other related content to 
fulfill their needs. Such activity should also be controlled and verified. With the 
enormous amount of data available on the internet, pedophiles can be more 
active and get more information than needed. Such data should be restricted that 
are not safe for a child's security. Such major issues need to be controlled and the 
pedophiles should be kept isolated from such data. 


Experiments: 

The problem deals with many aspects in the field of web data related to 
pedophilia activities. Here, the focus is mainly on the parts related to search sys- 
tems and chat logs. 


A. Chat Logs 

The chat servers are widely used in today's day to day life. The data obtained 
from chat logs contain plenty of information. Text mining is to be applied on the 
data available as chat data. There is an American foundation Perverted Jus- 
tice(PJ) where who investigates cases of online child sexual abuse. Adult volun- 
teers enter chat rooms as juveniles (usually 12-15-year-old) and if they are sexu- 
ally solicited by adults, they work with the police to prosecute the offenders. 
Some chat conversations with cyberpedophiles are available at www.perverted- 
Justice.com and they have been the subject of analysis of recent research on this 
topic. 


The chat lines are mainly categorized into the following: 
1. Exchange of personal Information 

2. Grooming 

3. Approach 

4. None of the above classes 


Step 1: Exchange of personal information 
Pedophile: Hey beautiful 
Pedophile: What's your number 


Step 2: Grooming 
Pedophile: Yeah you need come hangout sometime soon my friends and yours 
Pedophile - Hmmm like what have you done privately 


Step 3: Approach 

Predator: licking don't hurt 

Predator: it's like u lick ice cream 

Pseudo-victim: do ucare that I'm 13 in March and not yet? I lied a little bit b4 
Predator: it's all cool 


Step 4: None of the above 

Predator: don't tell anyone we have been talking 
Pseudo-victim: k 

Pseudo-victim: lol who would I tell? No one's here. 
Predator: well I want it to be our secret 


The idea is to notify that a person is a pedophile or not based on the language and 
writing methods used by a person during an online chat conversation. The chat 
logs are taken from the perverted justice website where the data 1s available for 
research purposes. 
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The figure 4 shows the chat log between Josh and Decoy that was available on the 
PJ website. This data needs to be processed to gain knowledge. The steps catego- 
rization of the chat can be seen. The data then undergoes preprocessing because 
of the humongous amount of information and raw data. The chat log data is an 
unstructured for that needs to be processed and formatted into a structured form 
of data which will help to easily process the data as per the requirement or need of 
the system. 
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Fig 5. Text Extraction 


Fig. 5 shows the data of chat log into a structured format which will be further 
needed to for processing of this data. The idea is to differentiate the raw data and 
manage it in a meaningful manner that is easily understandable and usable for the 
mining process. 
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Fig 6. Word Extraction 
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Fig. 6 shows the chat log conversation differentiating the whole text into individ- 
ual words. These individual words are further used as information retrieval and 
mining process. This is a part of the preprocessing of the data that is unstructured 
to make it a simpler form that is used in the knowledge gaining process. 


Fig.7 shows the list of distinct words that are used in the conversation between 
Josh and Decoy. 
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Fig 7. Distinct Word Extraction 
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Fig. 8 Negative Words 
The fig. 9 shows the frequency of the words in the conversation between Decoy 
and Josh. As seen, the word “sexy” and “fucking” are repeated many times in the 
conversation. These words fall under the category of negative words and they 


play a vital role to decide whether the person is pedophile or not. Based on this 
record we can say that Joshis a pedophile and 1s a threat for Decoy a.k.a. Erica. 
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Fig. 9 Frequency of each word in the conversation 
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The frequency of each word is calculated that are distinctively used in the con- 
versation. The frequency of words decides the level of pedophilic activity in the 
conversation. The negative set words are developed based on the pedophilic 
activities on the internet. The words in the conversations are compared to these 
set of negative words. The frequency of the negative words is higher than the 
chances of the person to be a pedophile increases as compared to the frequency of 
the positive words. The increase in the negative words describes the person is a 
pedophile and he/she is a threat to the child. This data can be tracked by the police 
or the cybercrime department and restrictions can be put against such person and 
activity. 


B. Content Search 

The search engine does not provide any restrictions on the search query. The 
information available online can be hazardous to many including children. The 
data that is accessible over the web is not secured and mainly does not deal with 
privacy or age verification. The data that is viewed by millions of end-users does 
not have to deal with the owner of the data. The end user searches for information 
over the web that contains data such as images and videos. If the data requested 
by the user is inappropriate, then limitations must be kept for such users. For 
example, ifa person is searching for a “mountain”, it does not deal with any of ille- 
gal activity or negative words. Therefore, that person can be considered safe and 
does not deal with any illegal activities. But ifthe person searches for an image of 
a naked child, he is a pedophile and response should not be available for his 
request as he is a threat for the society, especially children. Such data is not acces- 
sible for end user and only the cybercrime department has the authority to access 
such data. The search request prediction is not fully accurate because the data 
requested by the user might not contain the intention of the search. But assump- 
tions can be made as to protect the child from being attacked by a pedophile. The 
request of any end user is approved by the search engine. Whereas, that should 
not be the scenario. All requests are not acceptable as some may lead to some ille- 
gal or criminal activities. Despite of the increase in cybercrimes, researchers are 
leading on the way as to how such crimes can be decreases or minimized. 


The idea is to create a search page that restricts the user to search from the content 
that are related to pedophilia. 


Fig. 11 shows that a user has request for details of a mountain. This word is not 
related to pedophile activities so the result would be displayed as required. 
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Fig. 10 Search Page 
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Fig. 11 Search Page (1) 
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Fig. 12 Search Page (2) 


Fig. 12 shows the user has requested for details of a “naked child” and hence the 
details will not to be given to him/her because there is a threat that the person 1s a 
pedophile. Such restrictions are needed while developing a search engine sys- 
tems. 


CONCLUSIONS: 

Creating a safe environment for children is the vital necessity. Children and teen- 
agers have a door to link to the world with the internet. This work proposes vari- 
ous aspects in which pedophilic activities are possible and how they can be 
resolved using the chat logs and search engines. The chat systems propose to iden- 
tify a pedophile. This may be used by the police or the cybercrime department to 
track the pedophile and stop such activities, keeping the children safe from pred- 
ators. 


Also, proposing a model for the search engines that restricts the usage and 
request queries from the user. The engine does not allow the user to obtain the 
data that the model might think would lead to a pedophilic activity. Thus, creat- 
ing a safer internet-space for the children. 


The chat model can be further extended and sentiment can be analyzed. Based on 
the sentiments, the level of pedophile can be obtained or predicted. The 
multilingualism constraint can be resolved and much accurate result can be 
obtained. 


The search model leads to a restricted search. This restriction can be extended on 
various terms and areas and therefor leading limitations of search for many ille- 
gal criminal activities. A general model can be used with any search engine thus 
leading towards a restricted internet. The police can keep a track on those preda- 
tors who tries to search items only related to pedophilic activities. Such 
pedophiles can be tracked down through IP addresses. 


The work proposed are limited to two aspects over the internet. There are various 
other aspects that are needed to be kept safe for children. The online video calling 
does not restrict the pedophile to socialize with the children. Due to lack of atten- 
tion of parents, children fall in the trap of pedophile and sometimes even they are 
recorded. 
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